Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support UUIDs to pyarrow on more backends #8901

Closed
wants to merge 1 commit into from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Apr 5, 2024

partially fixes #8902.

Implements UUID execution to pyarrow on some backends, and adds notimpl tests for the rest.

@NickCrews NickCrews force-pushed the uuid-to-pyarrow branch 5 times, most recently from 2d5361f to bb2087d Compare May 11, 2024 04:18
@NickCrews NickCrews changed the title test: add tests for executing UUIDs to pyarrow feat: support UUIDs to pyarrow on more backends May 11, 2024
@NickCrews NickCrews force-pushed the uuid-to-pyarrow branch 3 times, most recently from accada4 to b853a19 Compare May 11, 2024 07:19
@cpcloud cpcloud added this to the 9.2 milestone Jun 13, 2024
@cpcloud cpcloud requested a review from jcrist June 13, 2024 12:30
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid mixing pyarrow and pandas conversion paths.

ibis/formats/pyarrow.py Show resolved Hide resolved
@NickCrews
Copy link
Contributor Author

OK, I think this brings up a larger philosophical question: Do we want to totally separate the pandas and pyarrow codepaths, or can they rely on each other?

Currently, to get pyarrow results from a backend:

  1. for some backends we go straight from the DB cursor object to pyarrow arrays, never needing pandas.
  2. In the backends that I touch in this PR, we go through the path of db_cursor -> pandas -> pyarrow.

I think the coupling between pandas and pyarrow for this conversion isn't inherently bad (we don't need to implement the db -> pyarrow path!), but I agree that it should be isolated, so we are very clear where we are mixing these two ecosystems, so that for the backends that don't need it, you can just have pyarrow installed, you don't need pandas.

So I see two options:

  1. keep this db_cursor -> pandas -> pyarrow path, but just sequester it into some 3rd module that is external to both ibis/formats/pandas.py and ibis/formats/pyarrow.py
  2. in these backends that don't have it yet, implement the db_cursor -> pyarrow conversion directly.

I think I would lean towards 2. I want to remove reliance on pandas as much as possible. Possibly this implementation won't be that hard for these other backends.

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

I think we'd to eventually be able to offer Ibis without requiring pyarrow or pandas, or least without requiring pandas. Many systems are starting to have arrow-native endpoints that don't involve pandas, so db -> pyarrow is actually better for those cases.

There's also the potential of using something that doesn't depend on either of those for the core (like printing tables), so I think we'd like to keep things as isolated from one another as possible.

Even more is the fact that sending anything through pandas is likely to result in some kind of type or value alteration that doesn't happen with pyarrow. Especially with NULLs, pandas is likely to do something completely different and incompatible with what pyarrow would do.

@NickCrews
Copy link
Contributor Author

Ok, when I get back to this I'll try the db -> arrow method!

@cpcloud cpcloud removed this from the 9.2 milestone Jul 16, 2024
@cpcloud
Copy link
Member

cpcloud commented Jul 23, 2024

Is this PR still viable?

@NickCrews
Copy link
Contributor Author

viable, I just stopped needing it personally so the urgency of it dropped a lot compared the 5 million other PRs I have open haha. Feel free to close if you want, and re-open once someone actually finds time to work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: postgres.to_pyarrow(ibis.uuid()) errors
2 participants